IEEE TRANSACTIONS ON SIGNAL PROCESSING (REVISED) 



o 



X 



Sparse Volterra and Polynomial Regression 
Models: Recoverability and Estimation 

Vassilis Kekatos, Member, IEEE, and Georgios B. Giannakis*, Fellow, IEEE 



O 

CN ■ Abstract 
C/3 



Volterra and polynomial regression models play a major role in nonlinear system identification and 
inference tasks. Exciting applications ranging from neuroscience to genome-wide association analysis 
build on these models with the additional requirement of parsimony. This requirement has high interpre- 
tative value, but unfortunately cannot be met by least-squares based or kernel regression methods. To this 
end, compressed sampling (CS) approaches, already successful in linear regression settings, can offer a 
viable alternative. The viability of CS for sparse Volterra and polynomial models is the core theme of 
this work. A common sparse regression task is initially posed for the two models. Building on (weighted) 
Lasso-based schemes, an adaptive RLS-type algorithm is developed for sparse polynomial regressions. 

ON 

\Q ' The identifiability of polynomial models is critically challenged by dimensionality. However, following 

r» : 

the CS principle, when these models are sparse, they could be recovered by far fewer measurements. To 
quantify the sufficient number of measurements for a given level of sparsity, restricted isometry properties 

o 

(RIP) are investigated in commonly met polynomial regression settings, generalizing known results for 
their linear counterparts. The merits of the novel (weighted) adaptive CS algorithms to sparse polynomial 
modeling are verified through synthetic as well as real data tests for genotype-phenotype analysis. 
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I. Introduction 

Nonlinear systems with memory appear frequently in science and engineering. Pertinent application 
areas include physiological and biological processes Q, power amplifiers J2j, loudspeakers ||3TI . speech, 
and image models, to name a few; see e.g., |[T6l . If the nonlinearity is sufficiently smooth, the Volterra 
series offers a well-appreciated model of the output expressed as a polynomial expansion of the input 
using Taylor's theorem ll20l . The expansion coefficients of order P > 1 are P-dimensional sequences of 
memory L generalizing the one-dimensional impulse response sequence encountered with linear systems. 
However, polynomial expansions of nonlinear mappings go beyond filtering. Polynomial regression aims 
at approximating a multivariate nonlinear function via a polynomial expansion (131 . Apart from its 
extensive use for optical character recognition and other classification tasks ||23l , (generalized) polynomial 
regression has recently emerged as a valuable tool for revealing genotype-phenotype relationships in 
genome- wide association (GWA) studies H, (271, EU, (181 . 

Volterra and polynomial regression models are jointly investigated here. Albeit nonlinear, their input- 
output (I/O) relationship is linear with respect to the unknown parameters, and can thus be estimated 
via linear least-squares (LS) COD, fT3l . The major bottleneck is the "curse of dimensionality," since the 
number of regression coefficients M grows as 0(L P ). This not only raises computational and numerical 
stability challenges, but also dictates unpractically long data records N for reliable estimation. One 
approach to coping with this dimensionality issue it to view polynomial modeling as a kernel regression 
problem (TT1, S ED- 

However, various applications admit sparse polynomial expansions, where only a few, say s out of M, 
expansion coefficients are nonzero - a fact that cannot be exploited via polynomial kernel regression. The 
nonlinearity order, the memory size, and the nonzero coefficients may all be unknown. Nonetheless, the 
polynomial expansion in such applications is sparse - an attribute that can be due to either a parsimonious 
underlying physical system, or an over-parameterized model assumed. Sparsity in polynomial expansions 
constitutes the motivation behind this work. Volterra system identification and polynomial regression 
are formulated in Section JI] After explaining the link between the two problems, several motivating 
applications with inherent sparse polynomial structure are provided. 

Section [Till deals with the estimation of sparse polynomial expansions. Traditional polynomial filtering 
approaches either drop the contribution of expansion terms a fortiori, or adopt the sparsity-agnostic 
LS estimator |fl6l . Alternative estimators rely on: estimating a frequency-domain equivalent model; 
modeling the nonlinear filter as the convolution of two or more linear filters; transforming the polynomial 
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representation to a more parsimonious one (e.g., using the Laguerre expansion); or by estimating fewer 
coefficients and then linearly interpolating the full model; see |[T6l and references thereoff. However, the 
recent advances on compressive sampling JS], and the least-absolute shrinkage and selection operator 
(Lasso) |[25l offer a precious toolbox for estimating sparse signals. Sparse Volterra channel estimators 
are proposed in ITT31 and ifTTl . Building on well-established (weighted) Lasso estimators E51 , ||32l , and 
their efficient coordinate descent implementation lTT2l . the present paper develops an adaptive RLS-type 
sparse polynomial estimation algorithm, which generalizes CQ to the nonlinear case, and constitutes the 
first contribution. 

Performance of the (weighted) Lasso estimators has been analyzed asymptotically in the number of 
measurements N iTTOl . |[32l . With finite samples, identifiability of Lasso-based estimators and other com- 
pressive sampling reconstruction methods can be assessed via the so-called restricted isometry properties 
(RIP) of the involved regression matrix ||6], |@J. It has been shown that certain random matrix ensembles 
satisfy desirable properties with high probability when iV scales at least as slog(M/s) Q. For Gaussian, 
Bernoulli, and uniform Toeplitz matrices appearing in sparse linear filtering, the lower bound on N has 
been shown to scale as s 2 logM lfl4l . ll22l . Section HV-AI deals with RIP analysis for Volterra filters, 
which is the second contribution of this work. It is shown that for a uniformly distributed input, the 
second-order Volterra filtering matrix satisfies the RIP with high probability when N scales as s 2 log M, 
which extends the bound from the linear to the Volterra filtering case. 

The third contribution is the RIP analysis for the sparse polynomial regression setup (Section HV-Bl i. 
Because there are no dependencies across rows of the involved regression matrix, different tools are 
utilized and the resultant RIP bounds are stronger than their Volterra filter counterparts. It is proved that 
for a uniform input, s-sparse linear-quadratic regression requires a number of measurements that scales 

s log 4 L. The same result holds also for a model oftentimes employed for GWA analysis. 

Applicability of the existing batch sparse estimators and their developed adaptive counterparts is 
demonsttated through numerical tests in Section [V] Simulations on synthetic and real GWA data show that 
sparsity-aware polynomial estimators can cope with the curse of dimensionality and yield parsimonious 
yet accurate models with relatively short data records. The work is concluded in Section |VT] 

Notation: Lower-(upper-)case boldface letters are reserved for column vectors (matrices), and calli- 
graphic letters for sets; ljy denotes the all-ones vector of length N; (-) T denotes transposition; jV(m, S) 
stands for the multivariate Gaussian probability density with mean m and covariance matrix S; E[-] 
denotes the expectation operator; ||x|| p := (X^iLi l^il^) 1 ^ for j> > 1 stands for the ^ p -norm in W 1 , and 
||x||o the ^o-(P seu do)norm, which equals the number of nonzero entries of x. 
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II. Problem Formulation: Context and Motivation 

Nonlinear system modeling using the Volterra expansion as well as the more general notion of 
(multivariate) polynomial regression are reviewed in this section. For both problems, the nonlinear I/O 
dependency is expressed in the standard (linear with respect to the unknown coefficients) matrix-vector 
form. After recognizing the "curse of dimensionality" inherent to the involved estimation problems, 
motivating applications admitting (approximately) sparse polynomial representations are highlighted. 

A. Volterra Filter Model 

Consider a nonlinear, discrete-time, and time-invariant I/O relationship y(n) = f (x(n), . . . , x(l)), 
where x(n) and y(n) denote the input and output samples at time n. While such nonlinear mappings can 
have infinite memory, finite-memory truncation is adopted in practice to yield y(n) = f (xi(ra)), where 
xi(n) := [x(n) . . . x(n — L + 1)] T with L finite. Under smoothness conditions, this I/O relationship 

can be approximated by a Volterra expansion oftentimes truncated to a finite order P as 

p 

y(n) = J2 H piMn)}+v(n) (1) 

where v(n) captures unmodeled dynamics and observation noise, assumed to be zero-mean and inde- 
pendent of Xi(ra) as well as across time; and H p [xi(ra)] denotes the output of the so-termed p-th order 
Volterra module h p (k\, . . . , k p ) given by 

L-1 L-1 p 

H p [x x (n)] := . . . h p {ki, . . . , k p ) Y[x(n - h) (2) 

fc 1= k p =0 i=l 

where memory L has been considered identical for all modules without loss of generality. The Volterra 
expansion in fl])-© has been thoroughly studied in its representation power and convergence properties; 
see e.g., |[20l , |fT6ll , and references therein. 

The goal here is to estimate h p {k\, . . . , k p ) for p = 0, 1, . . . , P, and fcj = 0, 1, . . . , L — 1, given the I/O 
samples {xi {n),y(n)}^ =1 , and upper bounds on the expansion order P and the memory size L. Although 
this problem has been extensively investigated |[T6l , the sparsity present in the Volterra representation of 
many nonlinear systems will be exploited here to develop efficient estimators. 

To this end, (fT)) will be expressed first in a standard matrix-vector form |[T6l . Define the vectors 
x p (n) := Xp_i(re) ® xi(ra) for p > 2, where ® denotes the Kronecker product; and write the p-th order 
Volterra output as H p [xi(n)] = (n)h p , where h p contains the coefficients of h p (k\, . . . , k p ) arranged 
accordingly. Using the latter, CO can be rewritten as 

y(n) = x T (n)h + v(n), n = l,...,N (3) 
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where x(n) := [l xf (n) . . . Xp(rt)] , and h := [ho hj . . . hp] . Concatenating (Q} for all n, one 
arrives at the linear model 

y = Xh + v (4) 
where y := [y(l) • • • y(N)f, X := [x(l) . . . x(N)f , and v := [v(l) . . . v(N)f. 

B. Polynomial Regression Model 

Generalizing the Volterra filter expansion, polynomial regression aims at approximating a nonlinear 
function y(n) = f ({^K 71 )}^ 1 ) °f L variables through an expansion similar to (H)-©, where the 
input vector xi(n) is now defined as xi(n) := [xo(n) . . . rE£_i(n)] T , and n is not necessarily a time 
index. Again the goal is to estimate h p (k\, . . . ,k p ) given {xi(n), y(n)}^ =1 . Polynomial regression can 
be interpreted as the P-th order Taylor series expansion of / (xi(n)), and appears in several multilinear 
estimation and prediction problems in engineering, natural sciences, and economics |[T3ll . 

By simply choosing xi(n) = x(n — I) for / = 0, . . . , L — 1, the Volterra filter is a special case of 
polynomial regression. Since this extra property has not been exploited in deriving fl])-©, these equations 
carry over to the polynomial regression setup. For this reason, the same notation will be used henceforth 
for the two setups; the ambiguity will be easily resolved by the context. 

C. The Curse of Dimensionality 

Estimating the unknown coefficients in both the Volterra system identification and in polynomial 
regression is critically challenged by the curse of dimensionality. The Kronecker product defining x p (n) 
imply that the dimension of h p is L p , and consequently h and x(n) have dimension Yl p =o ^ P = 
(L p+1 — TJ / (L — 1). Note that all possible permutations of the indices {k\, . . . , k p } multiply the same 
input term x^(n) ■ ■ ■ (n); e.g., /12(f), 1) and h2(l,0) both multiply the monomial xo{n)x\{n). To 
obtain a unique representation of ©, only one of these permutations is retained. After discarding the 
redundant coefficients, the dimension of h p and x p (n)'s is reduced to { L+ p ~ l ) ED . Exploiting such 
redundancies in modules of all orders eventually shortens h and x(n)'s to dimension 




which still grows fast with increasing L and P. For notational brevity, h and X will denote the shortened 
versions of the variables in dD; that is matrix X will be N x M. 
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D. Motivating Applications 

Applications are outlined here involving models that admit (approximately) sparse polynomial represen- 
tations. When P and L are unknown, model order selection can be accomplished via sparsity-cognizant 
estimators. Beyond this rather mundane task, sparsity can arise due to problem specifications, or be 
imposed for interpretability purposes. 

A special yet widely employed Volterra model is the so-called linear-nonlinear-linear (LNL) one fTo*! . It 
consists of a linear filter with impulse response {h a (k)}^l^ 1 , in cascade with a memoryless nonlinearity 
f(x), and a second linear filter {^&(fc)}jfc=rj 1- ^ ne overau memory is thus L = L a + — 1. If f(x) is 
analytic on an open set (a, b), it accepts a Taylor series expansion f(x) = Y^=$ c v xP * n x e ( a >^)- It 
can be shown that the p-th order redundant Volterra module is given by |[T6] Ch. 2] 

h p (ki, ...,kp) = c p 'y] h b {k)h a {ki - k) . . . h a (k p - k) (6) 

fc=0 

for ki £ {0, . . . , L — 1}. In ©, there are p-tuples (ki, . . . , k p ) for which there is no k G {0, . . . , — 1} 
such that {ki — k) G {0, . . . , L a — 1} for all i = 1, . . . ,p. For these p-tuples, the corresponding Volterra 
coefficient is zero. As an example, for filters of length L a = L b = 6 and for P = 3, among the 364 
non-redundant Volterra coefficients, the nonzero ones are no more than 224. When L a and L b are not 
known, the locations of the zero coefficients cannot be determined a priori. By dropping the second 
linear filter in the LNL model, the Wiener model is obtained. Its Volterra modules follow immediately 
from © and have the separable form h p (k\, . . . , k p ) = c p h a {k{) . . . h a (k p ) for every p |[T6l . Likewise, 
by ignoring the first filter, the LNL model is transformed to the so-called Hammerstein model in which 
h p (ki, . . . ,k p ) = c p hb(k) for k = ki = . . . = k p ; and otherwise. The key observation in all three 
models is that if at least one of the linear filters is sparse, the resulting Volterra filter is even sparser. 

That is usually the case when modeling the nonlinear behavior of loudspeakers and high-power 
amplifiers (HPA) |fT6l , El . When a small-size (low-cost) loudspeaker is located close to a microphone 
(as is the case in cellular phones, teleconferencing, hands-free, or hearing aid systems), the loudspeaker 
sound is echoed by the environment before arriving at the microphone. A nonlinear acoustic echo 
canceller should adaptively identify the impulse response comprising the loudspeaker and the room, and 
thereby subtract undesirable echoes from the microphone signal. The cascade of the loudspeaker, typically 
characterized by a short memory LNL or a Wiener model, and the typically long but (approximately) 
sparse room impulse response gives rise to a sparse Volterra filter ||311 . Similarly, HPAs residing at the 
transmitters of wireless communication links are usually modeled as LNL structures having only a few 
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coefficients contributing substantially to the output JU p. 60]. When the HPA is followed by a multipath 
wireless channel represented by a sparse impulse response, the overall system becomes sparse too iTTTl . 

Sparse polynomial expansions are also encountered in neuro science and bioinformatics. Volterra filters 
have been adopted to model causal relationships in neuronal ensembles using spike-train data recorded 
from individual neurons J3], (24]. Casting the problem as a probit Volterra regression, conventional 
model selection techniques have been pursued to zero blocks of Volterra expansion coefficients, and thus 
reveal neuron connections. Furthermore, genome-wide association (GWA) analysis depends critically on 
sparse polynomial regression models ||9], ||27| , 11281 . Through GWA studies, geneticists identify which 
genes determine certain phenotypes, e.g., human genetic diseases or traits in other species. Analysis has 
revealed that genetic factors involve multiplicative interactions among genes - a fact known as epistasis; 
hence, linear gene-phenotype models are inadequate. The occurrence of a disease can be posed as a 
(logistic) multilinear regression, where apart from single-gene terms, the output depends on products of 
two or more genes as well ||9). To cope with the under-determinacy of the problem and detect gene-gene 
interactions, sparsity -promoting logistic regression methods have been developed; see e.g., ||27l . 

Based on these considerations, exploiting sparsity in polynomial representations is well motivated and 
prompted us to develop the sparsity-aware estimators described in the following section. 

III. Estimation of Sparse Polynomial Expansions 

One of the attractive properties of Volterra and polynomial regression models is that the output is a 
linear function of the wanted coefficients. This allows one to develop standard estimators for h in (0]). 
However, the number of coefficients M can be prohibitively large for reasonable values of P and L, even 
after removing redundancies. Hence, accurately estimating h requires a large number of measurements N 
which: i) may be impractical and/or violate the stationarity assumption in an adaptive system identification 
setup; ii) entails considerable computational burden; and iii) raises numerical instability issues. To combat 
this curse of dimensionality, batch sparsity-aware methods will be proposed first for polynomial modeling, 
and based on them, adaptive algorithms will be developed afterwards. 

A. Batch Estimators 

Ignoring v in (|4]), the vector h can be recovered by solving the linear system of equations y = Xh. 
Generally, a unique solution is readily found if N > M; but when N < M, there are infinitely many 
solutions. Capitalizing on the sparsity of h, one should ideally solve 

min {||h|| : y = Xh} . (7) 
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Recognizing the NP-hardness of solving (O, compressive sampling suggests solving instead the linear 
program J8], (6] 

min {||h||i : y = Xh} (8) 

h 

which is also known as basis pursuit and can quantifiably approximate the solution of ©; see Section JV] 
for more on the relation between (|7) and (H). However, modeling errors and measurement noise, motivate 
a LS estimator h LS := argmhih ||y — Xh|||. If N > M and X has full column rank, the LS solution is 
uniquely found as h LS = (X T X) 1 X T y. If the input is drawn either from a continuous distribution or 
from a finite alphabet of at least P+l values, X T X is invertible almost surely; but its condition number 
grows with L and P |fT9l . A large condition number translates to numerically ill-posed inversion of X T X 
and amplifies noise too. If N < M, the LS solution is not unique; but one can choose the minimum 
^2-norm solution h LS = X T (XX r ) _1 y. 

For both over/under-determined cases, one may resort to the ridge (^-norm regularized) solution 

h Rid9e := (X T X + 5I M ) ~* X T y (9a) 
= X T (XX T + ( 5I JV )^ 1 y (9b) 

for some 5 > 0, where the equality can be readily proved by algebraic manipulations. Calculating, storing 
in the main memory, and inverting the matrices in parentheses are the main bottlenecks in computing 
Y^Ridge v j a q choosing (|9al ) versus (|9bl depends on how N and M compare. Especially for polynomial 
(or Volterra) regression, the (ni,7i2)-th entry of XX T , which is the inner product x T (rii)x(ri2), can 
be also expressed as J2p=o ( x r( n i) x i( n 2)) P - This computational alternative is an instantiation of the 
so-called kernel trick, and reduces the cost of computing XX T in (|9bl from 0(N 2 M) to 0{N 2 (L + P)) 
(231, CD; see also Subsection UTECl 

In any case, neither h LS nor h Rld 9 e are sparse. To effect sparsity, the idea is to adopt as regularization 
penalty the ^i-norm of the wanted vector ||25| 

1 M 
h = argmin-||y - Xh||| + Ajv^Wi|/ii| ( 10 ) 

i=l 

where hi is the i-th entry of h, and Wi > for i = 1, . . . , M. Two choices of Wi are commonly adopted: 
(wl) Wi = 1 for i = 1, . . . , M, which corresponds to the conventional Lasso estimator 11251 : or, 
(w2) Wi = \hf ldge \~ 1 for i = 1, . . . , M, which leads to the weighted Lasso estimator ||32l . 
Asymptotic performance of the Lasso estimator has been analyzed in ifTOl . where it is shown that the 
weighted Lasso estimator exhibits improved asymptotic properties over Lasso at the price of requiring the 
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ridge regression estimates to evaluate the Wi's IT321 . For the practical finite-sample regime, performance 
of the Lasso estimator is analyzed through the restricted isometry properties of X in Section [TV] where 
rules of thumb are also provided for the selection of A^r as well (cf. Lemma [TJ. 

Albeit known for linear regression models, the novelty here is the adoption of (weighted) Lasso for 
sparse polynomial regressions. Sparse generalized linear regression models , such as i\ -regularized logistic 
and probit regressions can be fit as a series of successive Lasso problems after appropriately redefining 
the response y and weighting the input X |[T3l Sec. 4.4.1], ETl . Hence, solving and analyzing Lasso 
for sparse polynomial expansions is important for generalized polynomial regression as well. Moreover, 
in certain applications, Volterra coefficients are collected in subsets (according to their order or other 
criteria) that are effected to be (non)zero as a group 024). In such applications, using methods promoting 
group-sparsity is expected to improve recoverability [ 301 - Even though sparsity is manifested here at the 
single-coefficient level, extensions toward the aforementioned direction constitutes an interesting future 
research topic. 

Algorithmically, the convex optimization problem in (fTOT i can be tackled by any generic second-order 
cone program (SOCP) solver, or any other method tailored for the Lasso estimator. The method of choice 
here is the coordinate descent scheme of lILTI . which is outlined next for completeness. The core idea is 
to iteratively minimize (fTOT i w.r.t. one entry of h at a time, while keeping the remaining ones fixed, by 
solving the scalar minimization problem 

min-Hy-X^h^-x^l^ + AjvWil/til (11) 

hi Z 

where Xj is the z-th columrQof X, variables X( _l ) and h( _l ) denote X and h, respectively, having the z-th 
column (entry) removed, and h is the latest value for the optimum h. It turns out that the component-wise 
minimization of (TTTT > admits the closed-form solution |[T2l 

t sign(zi) 

hi i [\zi\ - \NWi\ + (12) 

where [x] , := max(x, 0), Ra is the i-th entty of the sample correlation or Grammian matrix R := X T X 
and Zi is the i-th entry of Zj := X T — X.(~'^h(~' 1 ^ . After initializing h to any value (usually zero), 
the algorithm iterates by simply updating the entries of h via (fT2l . By defining z := X T — Xh^, 
vector Zj can be updated as 

Zi <- z + ri'hi (13) 

'Recall that x(n) stands for the n-th row of X. 
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with Ti being the i-th column of R. After updating hi to its new value (fl2l) . z has to be updated too as 

z^Zi-rihi. (14) 

It is easy to see that {zi}f£ 1 in (fT3T)-(fT4l are not essentially needed, and one can update only z. These 
iterates constitute the cyclic coordinate descent (CCD) algorithm for the (weighted) Lasso problem, and 
are tabulated as Alg. Q] CCD-(W)L is guaranteed to converge to a minimizer of (ITOb lTT2l . Apart from the 
initial computation of z and R which incurs complexity 0(M 2 N), the complexity of Alg. Q]as presented 
here is O(M) per coordinate iteration; see also lILTI . 

B. Recursive Estimators 

Unlike batch estimators, their recursive counterparts offer computational and memory savings, and 
enable tracking of slowly time-varying systems. The recursive LS (RLS) algorithm is an efficient imple- 
mentation of the LS, and the ridge estimators. It solves sequentially the following problem: 

N 

h% LS := argmin^^-" (y(n) - x T (n)h) 2 + (3 N 6\\h\\ 2 2 (15) 

h n=l 

where /3 denotes the forgetting factor and 5 a small positive constant. For time-invariant systems, /3 is 
set to 1, while <C /3 < 1 enables tracking of slow variations. Similar to the batch LS, the RLS does 
not exploit the a priori knowledge on the sparsity of h, and suffers from numerical instability especially 
when the effective memory of the algorithm, 1/(1 — /3), is comparable to the dimension M of h. 
To overcome these limitations, the following approach is advocated for polynomial regression: 

= argmin J^(h) (16) 

h 

N M 

4(h) := Y,P N ~ n (VW ~ x T (n)h) 2 + A^u^N 

n=l i=l 

where wn,{ can be chosen as 

(al) wn t i = 1 \/N, i = 1, . . . , M, which corresponds to the recursive Lasso (RL) problem; or, 
(a2) wn,i = (h^fl -1 VA r , i = 1, . . . , M, leading to the recursive weighted Lasso (RWL) one. 
The sequence {liAr} cannot be updated recursively, and ([Tol l calls for a convex optimization solver for 
each time instant or measurement N. To avoid the computational burden involved, several methods have 
been developed for sparse linear models; see [Q] and the references therein. The coordinate descent 
algorithm of Subsection IIII-AI can be extended to ( fl6l ) by first updating R and z as 

Rtv = PRn-i + x(AQx T (A0 (17a) 
z N = 0zjv_i + x(N){y(N) - x T (AT)h JV _ 1 ) ( nb) 



January 13, 2013 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (REVISED) 



1 1 



where hjv_i is a solution at time N — l. The minimizer h^v can then be found by performing component- 
wise minimizations until convergence in the spirit of the corresponding batch estimator. However, to 
speed up computations and leverage the adaptivity of the solution, we choose to perform a single cycle 
of component-wise updates. Thus, h/v is formed by the iterates of the inner loop in Alg. |2] where rjv,i, 
zn,i, Rjv,m> and fiN,i are defined as before. 

The presented algorithm called hereafter cyclic coordinate descent for recursive (weighted) Lasso 
(CCD-R(W)L) is summarized as Alg. |2j the convergence properties of CCD-RL have been established 
in 12 for linear regression, but carry over directly to the polynomial regression considered here. Its 
complexity is 0(M 2 ) per measurement which is of the same order as the RLS. By setting i«/v,i = or 
wn,i = l^wi | _1 > the CCD-R(W)L algorithms approximate the minimizers of the R(W)L problems. 

C. Polynomial Reproducing Kernels 

An alternative approach to polynomial modeling is via kernel regression ll23l . In the general setup, ker- 
nel regression approximates a nonlinear function /(xi) assuming it can be linearly expanded over a possi- 
bly infinite number of basis functions </>fc(xi) as /(xi) = Ylk=i a k4>k{^-i)- When </>fc(xi) = k (xi, xi(fc)) 
with «(•,•) denoting a judiciously selected positive definite kernel, /(xi) hes in a reproducing kernel 
Hilbert space %, and kernel regression is formulated as the variational problem 

min C ({/ (xi(n)) , y(n)}*U) + \\f\\ H (18) 

where C(-) is an arbitrary cost function, and \\f\\n is the norm in % that penalizes complexity of /. It 
turns out that there exists a minimizer of (fLSl l expressed as /(xi) = Xm=i a n K ( Xl > x i( n ))' while for 
many meaningful costs the a n 's can be computed in 0(N 3 ) using convex optimization solvers ||23l . 

Polynomial regression can be cast as kernel regression after setting Ac(xi(ni), xi(ri2)) to be either the 
homogeneous polynomial kernel (x^(ni)xi(ra2)) P , or, one of the inhomogeneous ones (l+x^(ni)xi(n2)) 
or ^2p =0 (x^(ni)xi(n2)) P |[23l , |[TTI . Once the a n 's have been estimated, the polynomial coefficients h 
(cf. (01)) can be found in closed form ffTTI . Furthermore, objectives C(-) such as the e-insensitive cost, 
yield sparsity in the a n -domain, and thus designate the so-called support vectors among the xi(n)'s |[23l . 
Even though kernel regression alleviates complexity concerns, the h which can indirectly obtained cannot 
be sparse. Thus, sparsity-aware estimation in the primal h-domain (as opposed to the dual a„-domain) 
comes with interpretational and modeling advantages. 
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IV. IDENTIFIABILITY OF SPARSE POLYNOMIAL MODELS 

This section focuses on specifying whether the optimization problems in ([8]) and (fTOb are capable 
of identifying a sparse polynomial expansion. The asymptotic in iV behavior of the (weighted) Lasso 
estimator has been studied in |[T0l , ||32l ; practically though one is more interested in finite-sample 
recoverability guarantees. One of the tools utilized to this end is the so-called restricted isometry properties 
(RIP) of the involved regression matrix X. These are defined as 0: 

Definition 1 (Restricted Isometry Properties (RIP)). Matrix X 6 M iVxM possesses the restricted isometry 
of order s, denoted as 5 S G (0, 1), if for all h £ ~R M with ||h||o < s 

(i-cy Hi < ||xh|H < (i + <y ||h|||. d9) 

RIP were initially derived to provide identifiability conditions of an s-sparse vector h D given noiseless 
linear measurements y = Xh . It has been shown that the £o-P seu donorm minimization in ((T) can 
uniquely recover h if and only if £ 2s < 1. If additionally 5 2s < V2 — 1, then h G is the unique minimizer 
of the basis pursuit cost in ([8]) 0. 

RIP-based analysis extends to noisy linear observations of an s-sparse vector; that is, for y = Xh D + v. 
If 1 1 v 1 1 2 < e, the constrained version of the Lasso optimization problem 

min {||h||i : ||y - Xh|| 2 < e} (20) 

h 

yields \\Iibn — h ||| < c 2 BN ■ e 2 , where cbn '■= x^^^/I+i) whenever < V% — 1 0. Furthermore, 
if v ~ jV(0, <t 2 Itv), the Dantzig selector defined as 

min {||h|| i: ||X T (y - Xh) IU < e D5 } (21) 

satisfies ||h_Ds — h ||| < cds • cr 2 s log M, where cds '■= s^f%,+\^ j w ^ P rorj ability at least 1 — 
(•7rlogM) -1//2 whenever 82s < V% — L and e^s = V^crvTogM [7]. Similarly, RIP-based recoverability 
guarantees can be derived in the stochastic noise setting for the Lasso estimator as described in the 
following lemma. 

Lemma 1. Consider the linear model y = Xh D + v, where the columns of X £ R NxM are of unit 
l2-norm, ||h ||o = s, and v ~ A/"(0, 0" 2 Iyv). Let h.i denote the minimizer of the Lasso estimator (110b 
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with Wi = 1 for i = 1 



,M, and A = Aa^logM for A > 2y/2. If 5 2s < 



3^+1 



, the bounds 



hL - h Q ||i < 



ISA 



(22) 



• as 



h L - Kg < - 
\ ' 

X(h L -h )||2 < 




■ a 2 slog M 



■ a 2 slog M 



(24) 



(23) 



/ 8 for c L = (1-Szs) (l 



l-S 2s 



) 



2 



hold with probability at least 1 — M 



Proof: The lemma follows readily by properly adapting Lemma 4.1 and Theorem 7.2 of EJ. ■ 
The earlier stated results document and quantify the role of RIP-based analysis in establishing identi- 
fiability in a compressive sampling setup. However, Definition Q] suggests that finding the RIP of a given 
matrix X is probably a hard combinatorial problem. Thus, to derive sparse recoverability guarantees one 
usually resorts to random matrix ensembles to provide probabilistic bounds on their RIP |6), |[22l . In 
the generic sparse linear regression setup, it has been shown that when the entries of X G jg^xM 
independently Gaussian or Bernoulli, X possesses RIP 5 S with probability at least 1 — exp (— 6 2 /(2C)) 
when the number of measurements is N > 2C/5 2 • slog(M/s), where C is a universal constant; this 
bound is known to be optimal [6j. In a sparse system identification setup where the regression matrix 
has a Toeplitz structure, the condition on the number of measurements N obtained so far loosens to 
a scaling of s 2 logM for a Gaussian, Bernoulli, or uniform input lTT4l . 11221 . The quadratic scaling of 
N w.r.t. s in the latter bound versus the linear scaling in the former can be attributed to the statistical 
dependencies among the entries of X |[22l . Our contribution pertains to characterizing the RIP of the 
involved regression matrix for both the Volterra system identification and the multivariate polynomial 
regression scenarios. 

A. RIP for Volterra System Identification 

For the Volterra filtering problem under study, the following assumptions will be in force: 
(asl) input {x n } is independently drawn from the uniform distribution, i.e., x n ~ U[— 1, 1]; and 
(as2) expansion is of order P = 2 (linear-quadratic Volterra model). 
Regarding (asl), recall that the Volterra expansion is a Taylor series approximation of a nonlinear function; 
thus, it is reasonable to focus on a bounded input region. Moreover, practically, one is frequently interested 
in the behavior of a nonlinear system for a limited input range. For (as2), the non-homogeneous quadratic 
Volterra model is a commonly adopted one. Generalization to models with P > 3 is not straightforward 
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and goes beyond the scope of our RIP analysis. The considered Volterra filter length is M = ( L J 2 ) ; and, 
for future use, it is easy to check that under (asl) it holds that E[x 2 ] = 1/3 and E[ar„] = 1/5. 

To start, recall the definition of the Grammian matrix R := X T X and let Rij denote its (i,j)-th 
entry. As shown in lfT4l Sec. Ill], the matrix X possesses RIP S s if there exist positive 6d and S with 
&d + &o = &s such that \Ra — 1| < 5d and < S a /s for every i, j with j ^ i. When these conditions 
hold, Gersgorin's disc theorem guarantees that the eigenvalues of Grammian matrices formed by any 
combination of s columns of X lie in the interval [1 — S s , 1 + S s ], and X possesses RIP 5 S by definition. 
In a nutshell, for a regression matrix X to have small <5 s 's, and hence favorable compressed sampling 
properties, it suffices that its Grammian matrix has diagonal entries close to unity and off-diagonal entries 
close to zero. If the involved regression matrix X had unit ^-norm columns, then the {Ru} would be 
unity by definition and one could merely study the quantity maxjj \ Rij\, defined as the coherence of 
X; see also ll22l p. 13] for the relation between coherence and the RIP. 

In the Volterra filtering problem at hand, the diagonal entries {Ru} are not equal to one; but an 
appropriate normalization of the columns of X can provide at least E[i?jj] = 1 for all i. The law of 
large numbers dictates that given sufficiently enough measurements N, the Ra's will approach their 
mean value. Likewise, it is desirable for the off-diagonal entries of R to have zero mean, so that they 
vanish for large N. Such a requirement is not inherently satisfied by all Rifs with j ^ i; e.g., the inner 



product between X columns of the form [a 



l T 



and [x 2 n _ 



k X n-k+l 



X 



n-k+N~l] 



for some n and k > has expected value N (E[x^]) that is strictly positive. 

To achieve the desired properties, namely 

(pi) E[Ru] = 1 for all i = 1, . . . , M, and 

(p2) E[Rij] = for all i, j = 1, . . . , M and j ^ i 
it will be soon established that instead of studying the RIP of X, one can equivalently focus on its 
modified version Xel M defined as 



X 



x 



(25) 



X' X 9 X 6 

where x c := ljv/V^/V corresponds to the constant (intercept or dc) component, X' and X c/ are two N x L 
Toeplitz matrices corresponding to the linear and quadratic parts defined as 

Xq X-i . . . X-L+l 
Xi X ... X_L +2 



X' :-- 



XN-l X N -2 



XN-L+1 



(26) 
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2 V N 



-L+2 



s iV-l 3 X 



2 1 
3 



(27) 



• ^AT-L+l 3 

and X 6 is a iV x L( - L ~ 1 - > (non-Toeplitz) matrix related to the bilinear part given by 



X° 



XqX-1 
XlXQ 



X\X-\ 



X-L+2X-L+1 
X-L+3X-L+2 



(28) 



XN-l x N-2 XN-lXN-3 ■ ■ ■ XN-L+2XN-L+1 

Consider now the Grammian of X, namely R := X T X. Comparing X with X, the columns of X 
have their ^-norm normalized in expectation, and thus R satisfies (pi). Moreover, those columns of X 
corresponding to the quadratic part (cf. submattix X 9 ) are shifted by the variance of x n . One can readily 
verify that (p2) is then satisfied too. 

The transition from X to X raises a legitimate question though: Does the RIP of X provide any insight 
on the compressed sampling guarantees for the original Volterra problem? In the noiseless scenario, we 
actually substitute the optimization problem in © by 



min < ||h| 
h L 



Xh 



}■ 



Upon matching the expansions Xh = Xh, the following one-to-one mapping holds 

L 



h 



1 



o 



o 



k=i 



3 r 



hi(k) = \lj^h 1 (k), k = l,...,L 



h 2 (k, k) 
h 2 (ki, k 2 ) 



j^~h 2 (k,k), k = l,...,L 
h 2 {ki,k 2 ), h = l,...,L, k 2 = h + l,. 



(29) 

(30a) 

(30b) 

(30c) 
(30d) 



It is now apparent that a sparse solution of (1291 translates to a sparse solution of © except for the 
constant term in (I30ab . By deterministically adjusting the weights {wi}^ and the parameter Xn in (TTOt . 
this argument carries over to the Lasso optimization problem and answers affirmatively the previously 
posed question. Note though that such a modification serves only analytical purposes; practically, there 
is no need to solve the modified compressed sampling problems. 
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Remark 1. Interestingly, transition from the original Volterra matrix to the modified one resembles the 
replacement of the Volterra by the Wiener polynomials for nonlinear system identification |[T6l . Wiener 
polynomials are known to facilitate mean-square error (MSE)-optimal estimation of Volterra modules 
for a white Gaussian input; see e.g., |[T6l . Our modification, adjusted to a uniformly distributed input, 
facilitates the RIP analysis of the Volterra regression matrix. 

One of the main results of this paper is summarized in the following theorem (see the Appendix for 
a proof). 

Theorem 1 (RIP in Volterra Filtering). Let {xi}fL_ L+1 be an input sequence of independent random 
variables drawn from U[—l, 1], and define M := (L + 1)(L + 2)/2. Assume that the N x M modified 
Volterra regression matrix X defined in (I25l )- (128l ) is formed by such an input for L > 7 and N > 160. 
Then, for any 5 S G (0, 1) and for any 7 £ (0, 1), whenever N > jrz^xjp " s<2 1°6 L, the matrix X possesses 
RIP 5 S for s > 2 with probability exceeding 1 — exp (^—-jf- ■ pQ, where C = 2, 835. 

The theorem asserts that an order s 2 log L observations suffice to recover an s-sparse non-homogeneous 
second-order Volterra filter of memory L probed by a uniformly distributed input scales as s 2 log L. Since 
the number of unknowns M is 0{L 2 ), the bound on iV scales also as s 2 logM. The bound agrees with 
the bounds obtained for the linear filtering setup lTT4l . whereas now the constants are larger due to the 
more involved dependencies among the entries of the associated regression matrix. 

B. RIP for Multivariate Polynomial Regression 

Consider now the case where /(x) describes a sparse linear-quadratic model 

L L L 

/(xx) = h + hi(k)xk + y~] ^2(^1,^2) (31) 

fc=l fel=l fc 2 =fcl 

Given N output samples {y(n)}^ =1 , corresponding to input data {xi(n)}^ =1 drawn independently from 
U [—1, 1] L , the goal is to recover the sparse M x 1 vector h comprising the hi(k)'s and ^2(^1, A^'s. 
Note that M = (L + 1)(L + 2)/2 here. As explained in Section HT1 the noiseless expansion in (|3TT > can 
be written as y = Xh; but, contrary to the Volterra filtering setup, the rows of X are now statistically 
independent. The last observation differentiates significantly the RIP analysis for polynomial regression 
and leads to tighter probabilistic bounds. 

Our analysis builds on ||22| . which deals with finding a sparse expansion of a function /(x) = 
Y^t=i c tV't( x ) over a bounded orfhonormal set of functions {V't(x)}. Considering V a measurable space, 



January 13, 2013 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (REVISED) 



17 



V 



e.g., a measurable subset of M. L endowed with a probability measure u, the set of functions {^(x) : 
V — > M.}f =1 is a bounded orthonormal system if for all ti, t% = 1, . . . , T 

^ti( x )V>t 2 ( x )<M x ) = <*ti,* a (32) 
where 5^ ,t 2 denotes the Kronecker delta function, and for some constant K > 1 it holds that 

sup sup iV'tWI < (33) 

After sampling /(x) at {x(n) G 2?}^L 1; the involved N x T regression matrix \l/ with entries t := 
(x(n)) admits the following RIP characterization ll22l Theorems 4.4 and 8.4]. 



Theorem 2 (RIP in bounded orthonormal systems 11221 ). Let \l/ be the N x T matrix associated with a 
bounded orthonormal system with constant K > 1 in d33b - 77ie?i, /or any <5 S G (0, 0.5], there exist universal 
positive constants C and 7, such that whenever N > • s log 4 T, f/ie matrix possesses RIP 5 S 

with probability exceeding 1 — exp ^—^^2 ■ y 

In the linear-quadratic regression of (|3H . even though the basis functions {1, {xj}, {x^x^}} are 
bounded in [— 1, 1] L , they are not orthonormal in the uniform probability measure. Fortunately, our input 
transformation trick devised for the Volterra filtering problem applies to the polynomial regression too. 
The expansion is now over the basis functions {^ m (x)} m=1 

MV&Ci}, (xl - I ,{3^x i2 }} (34) 

where the last subset contains all the unique, two-variable monomials lexicographically ordered. Upon 
stacking the function values {y n }^ =1 in y and properly defining h, the expansion y = Xh can be 
replaced by y = Xh, where the entries of X are 

A n,m •— 7= • (J->J 

VN 



Vectors h and h are related through the one-to-one mapping in (l30l : thus, sparsity in one is directly 
translated to the other. Identifiability of a sparse h can be guaranteed by the RIP analysis of X presented 
in the next lemma. 

Lemma 2 (RIP in linear-quadratic regression). Let Xj(n) for i = 1, . . . ,L and n = 1, . . . , N independent 
random variables uniformly distributed in [—1, 1], and define M := (L + 1)(L + 2)/2. Assume that the 
N x M modified polynomial regression matrix X in (1351 ) is generated by this sequence for L > 4. Then, for 
any 5 S G (0, 0.5], there exist universal positive constants C and 7, such that whenever N > 14 ^ c ■ s log L, 
the matrix X possesses RIP 5 S with probability exceeding 1 — exp ( — ^ • ^ 
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Proof: The inputs x(n) are uniformly drawn over T> = [— 1,1] L , and it is easy to verify that the 
basis functions {^> m (x)}^f =1 in (l34l form a bounded orfhonormal system with K = 3. Hence, Theorem 
|2] can be straightforwardly applied. Since M < L 2 for L > 4, it follows that log 4 M < 16 log 4 L. ■ 

Lemma [2] assures that an s-sparse linear-quadratic L-variate expansion with independent uniformly 
distributed inputs can be identified with high probability from a minimum number of observations that 
scales as s log 4 L or s log 4 M. Comparing this to Theorem [TJ the bound here scales linearly with s. 
Moreover, except for the increase in the power of the logarithmic factor, the bound is close to the one 
obtained for random Gaussian and Bernoulli matrices. The improvement over the Volterra RIP bound is 
explained by the simpler structural dependence of the matrix X involved. 

Another interesting polynomial regression paradigm is when the nonlinear function /(xi) admits a 
sparse polynomial expansion involving L inputs, and all products up to P of these inputs, that is 

L L L 

/(xi) = hp + y~] ht(k)x k + y~] ^2 h 2 {k 1 ,k 2 )x kl x k2 + ... (36) 

k=l k 1 =lk 2 =k 1 +l 
L L L 

+ ... ^ hp(ki, &2, • • • , kp)xi tl Xk 2 ■ ■ ■ x kp ■ 

fcl = l fc 2 =fcl+l fcp=fcj3_l+l 

This is the typical multilinear regression setup appearing in GWA studies E71 . ll9l . Because there are 
(p) monomials of order p, the vector h comprising all the expansion coefficients has dimension 

M = T( L ) <{L + lf (37) 

where the last inequality provides a rough upper bound. The goal is again to recover an s-sparse h 
given the sample phenotypes {y n }n=i over me genotype values {xi(n)}^ =1 . Vectors xi(n) are drawn 
either from { — 1,0, 1} L or {—1, 1} L depending on the assumed genotype model (additive for the first 
alphabet; and dominant or recessive for the latter) ||27| . Without loss of generality, consider the ternary 
alphabet with equal probabilities. Further, suppose for analytical convenience that the entries of xi(n) 
are independent. Note that the input has mean zero and variance 2/3. 

The RIP analysis for the model in d36l) exploits again Theorem |2] Since now every single input appears 
only linearly in (l36l . the basis functions {1, {x^}, {x^x^}, . . .} are orthogonal w.r.t. the assumed point 
mass function. A bounded orthonormal system {tp m (x)}^ =1 can be constructed after scaling as 

{l, {(2/3)- 1 / 2 x il }, {{2/3)- 2 / 2 x ilXt2 }, {(2/3)- p / 2 x tl x i2 ■ ■ ■ x tp }] (38) 

while the set is bounded by K = (3/2) p / 2 . Similar to the linear-quadratic case in (|3ll , the original 
multilinear expansion Xh is transformed to Xh, where X is defined as in (1331 ) with the new basis of 
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(I38T ). and h is an entry-wise rescaled version of h. Based on these facts, the RIP characterization of X 
follows readily from the ensuing lemmaJfl 

Lemma 3 (RIP in multilinear expansion). Let Xi(n) for i = 1, . . . , L and n = 1, . . . , N independent 
random variables equiprobably drawn from { — 1, 0, 1}, and M defined as in J37T) . The N x M modified 
multilinear regression matrix X in d35l ) and ( 1381 ) is generated by this sequence. Then, for any S s £ (0, 0.5], 
there exist universal positive constants C and 7, such that whenever N > S P 4 slog 4 (L + 1), the 
matrix X possesses RIP 5 S with probability exceeding 1 — exp (— c ^ s 2 y ' t)- 

Since P is often chosen in the order of 2 due to computational limitations, Lemma [3] guarantees the 
RIP to hold with high probability when the number of phenotype samples N scales at least as s log 4 L. 

V. Simulated Tests 

The RIP analysis performed in the previous section provides probabilistic bounds on the identifiability 
of sparse polynomial representations. In this section, we evaluate the applicability of sparsity-aware 
polynomial estimators using synthetic and real data. The experimental results indicate that sparsity- 
promoting recovery methods attain accurate results even when the number of measurements is less than 
the RIP-derived bounds, and, in any case, they outperform the sparsity-agnostic estimators. 

A. Batch and Adaptive Volterra Filters 

We first focus on the sparse Volterra system identification setup. The system under study was an LNL 
one, consisting of a linear filter with impulse response hj = [0.36 0.91 0.19] T , in cascade with the 
memoryless nonlinearity f(x) = — 0.5x 3 + OAx 2 + x, and the same linear filter. This system is exactly 
described by a Volterra expansion with L = 11 and P = 3, leading to a total of M = ( L p P ) = 364 
coefficients collected in the vector ho- Out of the 364 coefficients only 48 are nonzero. The system input 
was modeled as x(n) ~ A/"(0, 1), while the output was corrupted by additive noise v(n) ~ A/"(0,0.1). 
First, the batch estimators of Section IIII-AI were tested, followed by their sequential counterparts. 

, averaged over 100 Monte Carlo runs, is plotted against 



In Fig. 1(a) the obtained MSE, E 



lho-h||i 



the number of observations, N, for the following estimators: (i) the ridge estimator of © with 5=1; 
(ii) the Lasso (CCD-L) estimator with Aat=0.7V^V; and, (iii) the weighted Lasso (CCD-WL) estimator 

2 After our conference precursor [1151 . we became aware of a recent result in 1181 . which relates to Lemma [3] The differences 
are: i) only the P-th order term in expansion l !36> is considered in jlgl; and ii) inputs {xi(n)} adhere to the binary {±1} 
alphabet in 1181 , as opposed to the ternary one in Lemma [3] 
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with Aat=0.08 log N. The scaling rules for the two Aats follow the results of 0]] and (32]. It can be seen 
that the sparsity-agnostic ridge estimator is outperformed by the Lasso estimator for short observation 
intervals (iV<600). For larger N, where X T X becomes well-conditioned, the former provides improved 
estimation accuracy. However, CCD-WL offers the lowest MSE for every N, and provides reasonably 
accurate estimates even for the under-determined case (iV<364). 

Performance of the sequential estimator in Section IIII-B I was assessed in the same setup. Fig. |l(b)| 
illustrates the MSE convergence, averaged over 100 Monte Carlo runs, for the following three recursive 
algorithms: (i) the conventional RLS of ( 031 ); (ii) the cyclic coordinate descent recursive Lasso (CCD-RL); 
and, (iii) its weighted version (CCD-RWL). Since the system was time-invariant, the forgetting factor 
was set to /3 = 1. It can be observed that the conclusions drawn for the batch case carry over to the 
recursive algorithms too. Moreover, a comparison of Figs. 1(a) and |l(b)| indicates that the sparsity-aware 
iterates of Table |2] approximate closely the exact per time instance problem in ([Tol l. 



B. Multilinear Regression for GWA Analysis 

Here we test sparse polynomial modeling for studying the epistatic effects in quantitative trait analysis. 
In quantitative genetics, the phenotype is a quantitative trait of an organism, e.g., the weight or height of 
barley seeds ll26l . Ignoring environmental effects, the phenotype is assumed to follow a linear regression 
model over the individual's genotype, including single-gene (main) and gene-gene (epistatic) effects |[28l . 
0. The genotype consists of markers which are samples of chromosomes taking usually binary {±1} 
values. Determining the so-called quantitative trait loci (QTL) corresponds to detecting the genes and 
pairs of genes associated with a particular trait ll28l . Since the studied population N is much smaller 
than the number of regressors M, and postulating that only a few genotype effects determine the trait 
considered, QTL analysis falls under the sparse multilinear (for P = 2) model of (l36l ). 

1) Synthetic Data: The first QTL paradigm is a synthetic study detailed in ll28l . A population of 
iV=600 individuals is simulated for a chromosome of 1800 cM (centiMorgan) evenly sampled every 15 
cM to yield L = 121 markers. The true population mean and variance are 5.0 and 10.0, respectively. The 
phenotype is assumed to be linearly expressed over the intercept, the L main effects, and the (£) = 7, 260 
epistatic effects, leading to a total of M = 7, 382 regressors. The QTLs simulated are 9 single markers and 
13 marker pairs. Note that the simulation accommodates markers (i) with main only, (ii) epistatic only, 
and (iii) both main and epistatic effects. Since the intercept is not regularized, genotype and phenotype 
data were centered, i.e., their sample mean was subtracted, and the intercept was determined at the end 
as the sample mean of the initial I/O data on the fitted model. 
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Parameters 5 and A for ridge and (w)Lasso estimators, respectively, were tuned through 10-fold cross- 
validation over an 100-point grid lTT3l : see Table |l(a)| The figure of merit for selecting the parameters 
was the prediction error (PE) over the unseen data, i.e., Y^v=i lly« ~~ ^v^vWi/iN/V), where V = 10 
and h„ is the regression vector estimated given all but the (y t ,,X„) validation data. The value of 5 
attaining the smallest PE was subsequently used for determining the weights for the wLasso estimator. 
Having tuned the regularization parameters, the MSE provided by the three methods was averaged over 
100 Monte Carlo runs on different phenotypic data while keeping the genotypes fixed. The (w)Lasso 
estimators were run using the glmnet software Ifl2l . Each of the three algorithms took less than 1 min 
and 1 sec for cross-validation and final estimation, respectively. 



As can be seen from Table |I(a)[ Lasso attains the smaller PE. However, wLasso provides significantly 
higher estimation accuracy at a PE value comparable to Lasso. The number of non-zero regression 
coefficients indicated in the fourth column shows that ridge regression yields an over-saturated model. 
As shown more clearly in Fig. |2] where the true and the estimated models are plotted, the wLasso yields 
a sparser, closer to the true model, while avoiding some spurious coefficients found by Lasso. 

2 ) Real data from a barley experiment: The second QTL experiment entails a real dataset collected 
by the North American Barley Genome Mapping Project as described in Eoll . 1291 , and outlined shortly 
next. Aiming at a GWA analysis on barley height (HGT), the population consists of ^=145 doubled- 
haploid lines of a cross between two barley lines, Harrington and TR306. The height of each individual 
was measured under 27 different environments, and the phenotype was taken to be the sample average. 
There are L = 127 markers covering a 1270 cM segment of the genome with an average marker interval 
of 10.5 cM. The genotype is binary: +1 (-1) for the TR306 (Harrington) allele. There is a 5% of missing 
values which are modeled as zeros in order to minimize their effect fl28l . The main and epistatic QTL 
analysis involves M = 1 + 127+ = 8, 129 regressors. 

The regularization parameter values were selected through leave-one-out cross-validation lTT3l : see 



Table |I(b)| The ridge estimator fails to handle over-fitting and 5 is set to a large value yielding regression 
coefficients of insignificant amplitude. Using the ridge estimates to weight the regression coefficients, 
wLasso yields a PE slighty smaller than the one attained by Lasso; but it reduces the spurious coefficients. 
As shown in Fig. [3j wLasso provides a more parsimonious model with fewer spurious peaks than the 
Lasso-inferred model. Closer investigation of the wLasso QTLs exceeding 0.1 in magnitude, shown in 



Table |I(c)| offers the following interesting observations: (i) epistatic effects are not negligible; (ii) there 
are epistatic effects related to QTLs with main effects, e.g., the (35,99) pair is related to marker (101); 
(iii) there are epistatic effects such as the (9, 33) one involving markers with no main effect. 



January 13, 2013 



DRAFT 



IEEE TRANSACTIONS ON SIGNAL PROCESSING (REVISED) 



22 



VI. Conclusions 



The idea of exploiting sparsity in the representation of a system, already widely adopted for linear 
regression and system identification, has been permeated here to estimate sparse Volterra and polynomial 
models. The abundance of applications allowing for an interpretative parsimonious polynomial expansion 
and the inability of kernel regression to yield such an expansion necessitate sparsity-aware polynomial 
estimators. This need was successfully met here both from practical and analytical perspectives. Algo- 
rithmically, the problem was solved via the batch (weighted) Lasso estimators, where for the weighted 
one, the weights were efficiently found through the kernel trick. To further reduce the computational and 
memory load and enable tracking, an adaptive sparse RLS-type algorithm was devised. On the analytical 
side, RIP analysis was carried out for the two models. It was shown that an s-sparse linear-quadratic 
Volterra filter can be recovered with high probability using measurements in the order of s 2 log L; a 
bound that interestingly generalizes the results from the linear filtering problem to the Volterra one. For 
the sparse polynomial expansions considered, the bound improved to s log 4 L, which also generalizes the 
corresponding linear regression results. The potential of the aforementioned sparse estimation methods was 
numerically verified through synthetic and real data. The developed sparse adaptive algorithms converged 
fast to the exact solution, while the (weighted) Lasso estimators outperformed the LS -based one in all 
simulated scenarios, as well as in the GWA study on real barley data. Future research directions include 
extending the bounds derived to higher-order models, and utilizing our adaptive methods to accomplish 
epistatic GWA studies on the considerably higher dimensional human genome. 



Outlining some tools regarding concentration inequalities precede the proof of Theorem Q] 

Lemma 4 (Hoeffding's inequality). Given t > and independent random variables {xi}^L 1 bounded as 
a-i < Xi < bi almost surely, the sum sn ■= Yli=i x i satisfies 



It is essentially a Chernoff-type result on the concentration of a sum of independent bounded random 
variables around its mean. However, the subsequent analysis on the RIP of the Volterra filter considers 
sums of structurally dependent random variables. Useful probability bounds on such sums can be derived 
based on the following lemma. 



Appendix 




(39) 
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Lemma 5 (Hoeffding's inequality with dependent summands J2T|). Consider random variables {xi}f =l 
bounded as a < x% < b almost surely. Assume also they can be partitioned into M collectively exhaustive 
and mutually exclusive subsets {A/" m }^f =1 with respective cardinalities {N m }^ =1 such that the variables 
within each subset are independent. Then, for any t > the sum sn '■= ^2iL\ x i satisfies 



where N min := min m {iV m }. 

Note that the sharpness of the bound in (l40l depends on the number of subsets M as well as the 
minimum of their cardinalities N m - m . One should not only strive for the minimum number of intra- 
independent subsets, but also arrange iV m 's as uniformly as possible. For example, partitioning with the 
minimum number of subsets may yield N m [ n = 1 that corresponds to a loose bound. 

The partitioning required in Lemma [5jis not always easy to construct. An interesting way to handle this 
construction is offered by graph theory as suggested in [ETI . The link between structural dependencies 
in a set of random variables {xi}^ and graph theory hinges on their dependency graph G. The latter is 
defined as the graph having one vertex per Xi, and an edge between every pair of vertices corresponding 
to dependent Xj's. Recall that the degree of a vertex is the number of edges attached to it, and the degree 
of a graph A(G) is the maximum of the vertex degrees. Finding group-wise statistical independence 
among random variables can be seen as a coloring of the dependency graph. The problem of coloring 
aims at assigning every vertex of a graph to a color (class) such that there are no adjacent vertices 
sharing the same color. Moreover, coloring of a graph is equitable if the cardinality of every color does 
not differ by more than one from the cardinalities of every other color. Thus, an M-equitable coloring of 
the dependency graph means that the random variables can be partitioned in M intra-independent subsets 
whose cardinalities are either I or I J^J + 1. A key theorem by Hajnal and Szemeredi guarantees that 
a graph G has an M-equitable coloring for all M > A(G) + 1; see e.g., ll2lTl . Combining this result with 
Lemma [51 yields the following corollary. 

Corollary 1 (Hoeffding's inequality and dependency graph ll2lTl . Ifl4l0 . Consider random variables 
{xi}^ =l bounded as a < x-i < b. Assume also that their dependency graph has degree A. Then, the 
sum sn := £^ =1 x^ satisfies for every integer M > A + 1 and t > 




(40) 




(41) 



Having presented the necessary tools, the proof of Theorem Q] is presented next. 
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Proof of Theorem \J} Consider a specific realization of X and its Grammian R. As guaranteed by 
the Gersgorin disc theorem, if \Ru — 1| < 5d and \Rij\ < 5 /s for every i, j with j ^ i while 5d + S = 5 
for some 5 E (0, 1), then matrix X possesses RIP 5 S < 5 |[T4l . Thus, the probability of X not satisfying 
RIP of value 5 can be upper bounded as 



' M mm ^ 



Pr (5 S > 5) < Pr 



(42) 



\j{\Ru-l\>6 d } or[JU N>7 

. 8=1 8=1 j=\ ^ > 

Apparently, the events in the right-hand side (RHS) of (l42l are not independent. Exploiting the symmetry 



of R, the union bound can be applied for only its lower triangular part yielding 

M M M , - . 

Pr (6 S >5)< J^Pr (|^ -l\>S d )+J2 E Pr ( 1^' - f ) ■ (43) 

i=2 i=l j=i+l ^ ' 

Our next goal is to upper bound the probabilities appearing in the RHS of d43l) . Different from the 
analysis in lfT4l for the linear case, the entries of R exhibit different statistical properties depending 
on the components (constant, linear, quadratic, bilinear) of the nonlinear system they correspond to. To 
signify the difference, we will adopt the notation i??- instead of Rij, where a and /3 can be any of 
{c, I, q, b}, to indicate that the entry Rfj is the inner product between the i-th and the j-th columns of 
X, but also the i-th(j-th) column comes from the a(/3) part of the system. For example, the element lif- 
is the inner product of a column of X 9 with a column of X^. Recall also that R satisfies the requirements 
E[Ru] = 1 and E[^] = for j / i. 

We start with the L diagonal entries Rf { , where each one of them can be expressed as A Ylk=i x n-k 
for some n. Upon recognizing this quantity as a sum of N independent random variables confined in 
the interval [0, , Hoeffding's lemma can be readily applied. The bound obtained is multiplied by L 



to account for all i?"'s; hence 



L+1 ' 2NS 2 



i=2 



£ Pr ~ 1| > 5d) < 2Lexp -* . (44) 



Similarly, each one of the L diagonal entries Rff is equal to ^ SfcLi { x n-k ~ I) 2 f° r some n > which 
is a sum of N independent random variables bounded in [0, JM. Lemma |4] yields 



2L+1 / 2N5 2 



E Pr (l^ " !| ^ ^) < 2 ^exp ) . (45) 

i=L+2 



Before proceeding with the bilinear diagonal entries, let us consider first the off-diagonal entries Rfj. 
Each one of them is a sum of the form ^ Ylk=l x n~kX n ~m-k for m ^ n. However, the summands 
are not generally independent; every summand is a two-variable monomial and a single x n may appear 
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in two summands. This was handled in lfl4l after proving that Rf, can always be split into two partial 
sums, each including independent terms. As a clarifying example, the entry R l 2 l 3 can be expressed as 
jj [(xqX-i + X2X1 + ...) + (x±Xo + X3X2 + • • .)]• Moreover, the two partial sums contain [yj and |~y] 
summands. Applying Lemma [5] for t = 6 Q /s, M = 2, N m i n = |_y_|, an ^ ^ = ~ a = %/N, it follows that 



> — ) 



N 
~2 



18s 2 

Taking into account that [^J > ^ for N > 160, and since there are L(L — l)/2 < L 2 /2 off-diagonal 

?" 



R\\ terms, their collective probability bound is 



j=i+l 



Returning to the bilinear diagonal entries, every R%f can be written as J| SfcLi x n-k x n-m-k f° r some 
m 7^ 0. Even though the summands are not independent, they exhibit identical structural dependence 
observed in R^s; thus, the same splitting trick can be applied here too. Upon using Lemma[5]for t = 5d, 
M = 2, N min = |_f J, a = 0, and b = 9/N, and adding the contribution of all L(L - l)/2 < L 2 /2 
bilinear diagonal entries, we end up with 

£ Pr - 1| > 5 d ) < 21} exp (-^) ■ (48) 



i=2L+2 



Regarding the entries Rf- and R c ^-, an immediate application of Hoeffding's inequality yields 



L+l 



£Pr(|i^.|>^)<2Lex P (-§) (49) 

i=2 



— 1 <U .„ / SiW) 2 

45s 2 



E P r (|flg|>^)<2Lexp(-^) (50) 

j=L+2 



whereas the probabilities Pr M-Rfjl > <WsJ have been already accounted for in the analysis of the Rfj's. 
The entries R 1 ? can be written as %^ J2k=i x n-k {x 2 l _ k _ m — 5) for some n and m, where every 



summand lies in 



TV ' AT 



Two sub-cases will be considered. The first corresponds to the L entries 
R l ? with m = (or equivalently j = i + L), in which every summand depends on a single input. Through 
LemmaH] the sum of probabilities related to these L entries is upper bounded by 2L exp(— N5 2 /(30s 2 )). 
The second case includes the remaining (L 2 — L) entries with m ^ 0, for which the splitting trick can 
be applied to yield the bound 4(L 2 — L) exp (— [A r /2j(5 2 /(30s 2 )). Combining the two bounds yields 

E E*(i^7> 4 M-^)' <51) 

i=2 j=L+2 v 7 x 7 
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The Rfj entries can be expressed as ^ J2k=i( x n-k ~ |)( x n-fc-m ~~ |) f° r some m 7^ 0, where each 
summand is bounded in [—5^, ^77] • Exploiting the same splitting trick and summing up the contributions 
of all the L(L — l)/2 RfJ entries, yields 



i=L+2 j=i+l x 7 v 

The iS's can be written as the sum ^^Y^k=i x n-kX n -k-mX n -k-p for some n and m 7^ p, while 



every summand lies in 



3v^3 3\/3 
N ' JV 



Note that there exist ici-'s with summands being two-input mono- 



mials, i.e., for m = or p = 0. However, to simplify the presentation, the derived bound is slightly 
loosened by considering all Rfj's as sums of three-input monomials. This specific structure precludes 
the application of the splitting procedure into two halves, and necessitates use of the dependency graph. 
It can be shown that the degree of the dependency graph associated with the three-variable products for 
any R 1 ^ entry is at most 6. Then, application of Corollary [TJover the L 2 (L — l)/2 < L 3 /2 RVj entries 
together with the inequality [N/7\ > N/8, which holds for N > 160, yield 

t E * (i*5i ^)*"-«p <53) 

i=2 j=2L+2 v 7 v 7 

The Rjj's can be written as Ylk=i ( x n-k ~ I) x n-k-mX n -k-p f° r some n and m 7^ p, where the 
summands lie in — . Following a reasoning similar to the one for R 1 ^, 



E £ »{\*$\*%) <54) 

j=L+2 j=2L+2 



Finally, the fl^'s are expressed as ^ SfcLi ^n-^n-fe-m^-t-p^n-fc-m-,) for some n, m, p, and 
whereas the summands lie in [—771 77] • For any Ixf- entry, the summands are four-input monomials, and 
thus, the degree of the associated dependency graph is at most 12. Upon applying Corollary [TJover the 
L(L - 1)(L 2 — L — 2)/8 R^'s, and since [A^/13j > JV/14 for N > 160, we obtain 

M M 

2268s 2 

i=2L+2 j=i+l x 

Adding together the bounds for the diagonal elements (1441 . (|45T ), and d48l ), implies 

^TPr(|/^-l| >5 d ) <3L 2 expf-^] (56) 

i=2 ^ ' 

for L > 7. For the off-diagonal elements, upon adding d47j, (|49^- (|55) . it follows for L > 7 that 

j=2 j=i+l v 7 v 7 
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By choosing ^ = ^w^, the arguments of the exponentials in (l56l ) and (l57l i become equal, and after 



adding the two bounds, we arrive at 

sAW 7r 4 ™ ( 

2268s 2 



Pr (6 S >5)< 7L 4 exp ( ) . (58) 



Since <5 = (5^ + <5 G translates to <5 2 = ( s V^l 3 _ \ ^ > fj.85 2 for s > 2, the bound in d58l > simplifies to 

\sy56/3+l y 

Pr „>,),^e Xp (-^) £ e Xp (-^(^-^ 1 o gi )). TO 



Now set C:=2,835 and choose any 7 G (0, 1). Whenever iV > (jz^ ' « 2 logL, ([59]> yields 

j5 2 AT 

~~c ' 7 

which completes the proof. 



Pt(6 8 >6) <exp(-^-- 
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Algorithm 1 CCD-(W)L 



Algorithm 2 CCD-R(W)L 



Initialize z = X T y. 
Compute matrix R = X T X. 
repeat 

for i = 1, . . . ,M do 

Update z as z = z + Vihi. 
Update hi using dl2l l. 
Update z as z = z — nhi. 
end for 
until convergence of h. 



Initialize ho = Om, zo = Om, Ro = <5Im- 
for N= 1,2, ... do 

Update Rjv and zjv via ( fTTal and l fl7bl 

for i = 1, .. . ,M do 

Zjv = Zjv + rjv,i/lJV-l,i 



N,i 



■ign(zjy,i) 



Ujv,j 



Ajvwjv.i] 



Zjv = Zjv 

end for 
end for 



TN,ihN,i 




Fig. 1. MSE of (a) batch and (b) adaptive Volterra estimators. 
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*10O 



(a) True model 



(b) Ridge regression 





(c) Lasso (d) wLasso 

Fig. 2. Regression vector estimates for the synthetic gene data. The main (epistatic) effects are shown on the diagonal (left 
diagonal part), while red (green) bars correspond to positive (negative) entries. 




(a) Lasso (b) wLasso 

Fig. 3. Regression vector estimates for the real QTL barley data. The main (epistatic) effects are shown on the diagonal (left 
diagonal part), while red (green) bars correspond to positive (negative) entries. 
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TABLE I 

Experimental results for synthetic and real QTL data 

(a) Synthetic data (b) Real QTL barley data 

Method | PE | MSE | NNZ | SIX \ | Method | PE | NNZ | SIX 

Ridge 68.10 82.29 7382 0.61 N Ridge 826 8129 4.28-10 4 N 

Lasso 12.84 15.85 200 0.19 N Lasso 5.96 48 0.33 N 

wLasso 13.09 5.11 85 3.77 TV wLasso 5.69 34 6.88 N 

(c) QTLs estimated by wLasso for the real 



barley data 



Main effects 


Epistatic effects 


Marker 


Value 


Markers 


Value 


(12) 


+0.78 


(7,66) 


+0.19 


(53) 


-0.18 


(9,33) 


-0.29 


(61) 


+0.23 


(20,95) 


+0.13 


(101) 


+0.40 


(33,88) 


+0.10 


(104) 


+0.24 


(35,99) 


-0.47 


(112) 


+0.43 


(38,52) 


-0.15 






(56,92) 


+0.38 






(63,81) 


-0.19 
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